How does ollama run? What are the pieces? What happens when you ask a question to the model?
I have seen these and variations of these questions on the Ollama Discord, on the comments for the videos on my channel, and even on my Twitter where I am technovangelist and on Github where I am also technovangelist.
So let’s look at all of this in a bit more detail.
As of the recording of this video, Ollama is running on three platforms: Linux, Mac and Windows.
So for those three platforms, there is a single supported installation method for each. On Linux, there is an installation script that you can find on the site, and on Mac and Windows there is an installer. What do they do? Well, let’s look at the script for Linux. There is a lot here that is just dealing with CUDA drivers. In fact, in the 266 line shell script, well over 150 lines are just dealing with Nvidia. The rest of the script copies the binary into the right spot. Then sets up a new user and group so that the service doesn’t run as you. And then sets up the service using systemctl and ensures that it stays running. The Mac and Windows apps are a little different. But ends up with the same basic results. There is a binary that runs everything and there is a service running in the background using that same binary.
There is only a single binary, but it can run as a server or as the client, depending on the arguments given. To work with Ollama, there is always a server and there is always a client. That client can be the CLI or another application someone has written using the REST API. When you run ollama run llama2
you are running the interactive CLI client. And the client doesn’t actually do anything other than pass the request on to the server. Again, the server is running on your machine as well. We aren’t talking about any service running up in the cloud (unless YOU have set up a server to do that). The server, running in the background, takes that request and loads the model and lets the client know that it’s ready. Now you can interactively ask a question.
The same thing happens when you use the api to send a message to ollama. The ollama server loads the model and asks the question. Then it returns the answer to your api client. There is no need to run ollama run and the model if you are using the api. The cli client is just another api client just like the program you are writing.
Now I said all this runs locally. There are three exceptions to this. The first is when you have put the server on a remote machine. The other two are when you pull or push a model. In that case, a model is either being downloaded or uploaded to the ollama.com registry. So that opens up another question. Does ollama use my questions to improve the model and does that get uploaded to ollama.com when I push the model. That is of course one of the big concerns with using ChatGPT and other online models. Often your interactions with them go back into making the model better. Asking a question and getting an answer out of the local model with Ollama can take a while, but that amount of time is nothing compared to the time required to fine tune or train the model using your data. You would hear the fans whirring hard for a good long time if Ollama was able to do that. Ollama has no ability to fine tune a model today, so when you push a model, none of your questions and answers are added to the model…for the most part. I’ll talk more about an exception there in a bit.
Now some folks hate that there is a service that runs models running in the background all the time. It’s going to take memory, right? Models are big, and it shouldn’t stay in memory, right?
The memory consumed by the service is whatever is needed by the model while it is running. Then Ollama will eject that model after 5 minutes, although that is configurable. At that point, it drops to a minimal memory footprint. There isn’t really much reason to stop it at that point, but if you feel strongly that it shouldn’t run, then here is how to do it.
On Mac, come up here to the menu bar, click the ollama icon, and then choose Quit Ollama. On Linux run systemctl stop Ollama at the command line. And on Windows, come down here to the tray icon and choose Quit Ollama. If you just go and kill processes, they will restart and you will get frustrated. If you are on a linux distro that isn’t systemctl, then you either installed it manually or you used a community created install. There are a few of those, and I can’t really suggest the right way there. To get it started again, run ollama on mac or windows, and on linux run systemctl start ollama
.
Some folks hate the fact that ollama gets rid of the model after 5 minutes. They either want that time to be shorter or longer. You can set the time using the api parameter of keep_alive. If you set ‘keep alive’ to -1, Ollama will keep the model in memory forever, or you can specify the seconds, minutes, or hours. For now, as of this recording, this has to be done in the API and not via the cli or environment variables.
Remember a moment ago I said 'For the most part" with regards to adding your questions and answers to the model? There actually is an exception to this. But it’s a pretty special case. If i do ‘ollama run llama2’ and then ask it a few questions. “why is the sky blue”, “why is the wavelength longer or shorter”, and “can i surf those waves”… It’s an example, give me a break. Then I run /save m/mattwaves
(where m is the name of my namespace on Ollama.com), I have saved those messages as part of the model. The messages aren’t in the model weights file, though. I’ll come back to that in a sec.
So lets take a look at how this works. First the manifest for this model. Here we can see there are a few layers, and this one is called messages. Lets open the blob file it mentions. See the questions? they are all there along with the answers. So these are just like the messages you would set if you used the chat api. For instance if I want to use a few shot prompt showing the model how I expect it to provide a json formatted object, I would add the messages here. This is available in the modelfile as well using the message instruction. And they appear here as this messages layer. So you might think based on this that you could just edit these messages here in this file and it would replicate that behavior we saw in the modelfile or in the api. That will continue to work locally, but as soon as the file was pushed, the file signatures would be different and it wouldn’t work. So if you want to achieve this with the CLI, you will have to create the modelfile and update the messages there. Again, pretty special circumstances.
When we looked at the manifest, we focused on the messages layer, but there are a few other layers as well. You can see one for the system prompt, template, and others. One of them is the model weights file. That’s the really big file for each model. See how the name of the file is the sha256 digest of the file. When Ollama get’s the manifest, it looks to see if the corresponding files are on the system. If it already has the file, it doesn’t download it. So you might have a model llama2 and then another model call mycoolmodel someone has created based on llama2 with a unique system prompt. When you pull that model, the model weights file will already be there so adding the new model will have minimal impact on space consumed on your drive. Similarly, removing llama2 at this point will have minimal impact on your drive because mycoolmodel is still using that model weights file.
I think that’s all the info on how the pieces work together and how to use Ollama. If you have any questions about this or anything else, let me know in the comments below. Thanks so much for watching
Shorts - 220wpm #
server vs client in ollama #
…you just need to remember, with ollama there is a server and there is a client. But on all 3 supported platforms there is a single executable. You can manually run the server with the command ollama serve
but on Windows, Mac, and Linux, there is a service that runs automatically. At first the client is going to be the ollama CLI. With that, you can run any model that you find on Ollama.com and interactively ask the model any questions. But the client can also be a vscode extension like Twinny, an emacs or neovim plugin like Ellama, or an Obsidian plugin like copilot, or a desktop app like mindmac Or it could be what ever application you are writing using the REST api, the official JS or Python libraries, or any of the community supported libraries for Rust, Elixir, Ruby, Swift, Dart, and so many others. So you just need to remember, with ollama…
3 times local isn’t local #
normally everything in Ollama runs local to your computer. You never need to touch the outside world and your data will never be shared… mostly. There are 3 exceptions. Pushing a model means you can share your customized model with the world. In most cases, this is just going to be someone else’s model weights with your system prompt and template. When you pull a model you are accessing ollama.com to get a new model downloaded to your system. But there is a third exception which is interesting. If you run any model, then ask it a series of questions and get some answers, and then you run /save and the name of the model, those questions and answers get saved with the model. You can then push that model to Ollama.com and then anyone who pulls that model will see those questions and answers. Its not much of a risk, but something you should know. but normally everything in ollama…
Stopping and starting ollama service #
Ollama runs a server and a client with the server running as a service in the background on Mac, Windows, and Linux. And so there are three corresponding ways to start and stop the ollama service. Mac and Windows are pretty similar. On Mac you can click the llama in the menubar and choose to quit ollama. On windows you click the llama in the task tray and choose to quit ollama. In both cases you start the ollama service by double clicking the icon. If you installed in an unofficial way, like with brew, that may be different. On linux you run systemctl stop ollama to stop and systemctl start ollama to start. If you are on one of the unofficial platforms and installed with a package manager, then it may be different. but no matter what ollama runs a server and a client…
keep_alive #
When you enter the command ollama run and then a modelname in the cli client, or you launch the model with any API client, the model stays in memory for 5 minutes and is then unloaded from memory. But using the API you can change this by using the keep_alive parameter. Add it to the body when you do a generate or chat. Here you can see an example using curl, as well as the official python and JavaScript libraries. After keep_alive specify a time. Just a number as a string will be seconds, but if you also add an s, or an m, or h, it will be that many seconds, minutes, or hours. This is using the go time library and accepts units such as “ns”, “us” (or “µs”), “ms”, “s”, “m”, “h” and others. When you use keep alive you can be sure that the model will stay in memory for the right amount of time for your use case. Otherwise when you enter the command ollama run and then…
layers of the model #
Many of Ollama’s creators’ roots are in Docker. This shows when you look at the model in detail. In ollama a model is actually made of many layers just like a docker image. You can see these layers when you open the manifest which you can find under .ollama/models/manifests. Here is a manifest for llama2 7b. You can see there is a layer for the model weights which is a huge file, and then layers for the system prompts and other things. Here is a manifest m/myllama2copy which is based on llama2 7b. I specified my own system prompt and temperature, so while some of the layers have the same sha256 digest, others are different. These sha256 strings correspond with the filenames in .ollama/models/blobs. And if i open the file that corresponds with myllama2copy’s system prompt, I can see the value of the system prompt. If a file with the name of the digest already exists on the filesystem, any new model that you pull from ollama.com that uses that layer will leverage the file already on the system. Ollama’s layered model format is really easy to deal with especially when you realize that many of ollama’s creators’ roots are in docker.